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Abstract 

Background: The large Glycoside Hydrolase family 5 (GH5) groups together a wide range of enzymes acting on 
(3-linked oligo- and polysaccharides, and glycoconjugates from a large spectrum of organisms. The long and 
complex evolution of this family of enzymes and its broad sequence diversity limits functional prediction. With the 
objective of improving the differentiation of enzyme specificities in a knowledge-based context, and to obtain new 
evolutionary insights, we present here a new, robust subfamily classification of family GH5. 

Results: About 80% of the current sequences were assigned into 51 subfamilies in a global analysis of all publicly 
available GH5 sequences and associated biochemical data. Examination of subfamilies with cata lytica I ly-acti ve 
members revealed that one third are monospecific (containing a single enzyme activity), although new functions 
may be discovered with biochemical characterization in the future. Furthermore, twenty subfamilies presently have 
no characterization whatsoever and many others have only limited structural and biochemical data. Mapping of 
functional knowledge onto the GH5 phylogenetic tree revealed that the sequence space of this historical and 
industrially important family is far from well dispersed, highlighting targets in need of further study. The analysis 
also uncovered a number of GH5 proteins which have lost their catalytic machinery, indicating evolution towards 
novel functions. 

Conclusion: Overall, the subfamily division of GH5 provides an actively curated resource for large-scale protein 
sequence annotation for glycogenomics; the subfamily assignments are openly accessible via the 
Carbohydrate-Active Enzyme database at http://www.cazy.org/GH5.html. 

Keywords: Protein evolution, Enzyme evolution, Functional prediction, Glycogenomics, Glycoside hydrolase family 
5, Phylogenetic analysis, Subfamily classification 



Background 

Carbohydrates, in the form of mono-, di-, oligo-, and 
polysaccharides, as well as glycoconjugates, play funda- 
mental roles in all forms of life [1]. Beyond their role in 
energy storage, carbohydrates are central to diverse bio- 
logical processes such as host-pathogen interactions, sig- 
nal transduction, inflammation, intracellular trafficking, 
diseases, and differentiation/development. Not least, as 
structural components of terrestrial biomass, carbohy- 
drates comprise approximately 75% of the carbon fixed 
annually by primary production [2], Sugar-rich plant cell 
walls, seeds, and tubers thus represent a renewable 
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material with significant potential to address energy and 
material needs. 

A striking feature of carbohydrates is their remarkable 
structural complexity, due to a rich diversity of mono- 
saccharide building blocks, and the possibility of numer- 
ous stereo- and regiospecific linkages [3], which give rise 
to both simple linear and complex, highly branched 
molecules [1]. A decade of investments in genomics and 
proteomics has greatly improved our interpretation of 
the molecular language of the cell, but deciphering the 
complex carbohydrate-based information in the biomo- 
lecular landscape is still in its infancy. Indeed, glycomics 
has been identified both as "the last frontier of molecu- 
lar and cellular biology" [4] as well as an "emerging tech- 
nology that will change the world" [5]. 

Functional analysis of glycans and glycoconjugates is 
complicated by the fact that they are not direct genetic 
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products, but are instead synthesized, recognized, modi- 
fied, and degraded by a plethora of carbohydrate-active 
enzymes (CAZymes) and binding proteins. In the syn- 
thetic direction, phosphosugar-dependent glycosyltrans- 
ferases (GTs) catalyze the formation of glycosidic 
linkages, whereas their breakdown is mediated by glyco- 
side hydrolases (GHs) and polysaccharide lyases (PLs), 
with the assistance of carbohydrate esterases (CEs). The 
structural diversity of carbohydrates is reflected in an 
abundance of CAZyme-encoding genes, which comprise 
1-3% of the genome of most organisms [6]. Expanding 
and harnessing knowledge of the complexity of the 
"CAZome" is thus essential to understanding the com- 
plexity of the glycome. 

The protein sequence-based classification of CAZymes 
was initiated in 1991 as a complement to the long- 
standing Enzyme Commission (EC) number system [7], 
which is based solely on enzyme activities [8]. Given the 
prevalence of convergent evolution of enzymes that 
cleave glycosidic bonds, as well as the demonstrable 
catalytic promiscuity of individual enzymes, sequence- 
based classification has proven to be a robust way to 
unify information on enzyme structure, specificity, and 
mechanism, which provides enormous predictive power 
[9]. Initially motivated by a need to delineate cellulases 
(EC 3.2.1.4) into distinct structural families [10], the first 
incarnation of the GH family classification, as such, 
comprised 35 GH families [8]. The number of families 
increased steadily with the growing interest in Glycobiol- 
ogy so that, as of August 2012, 130 sequence-based fam- 
ilies of GHs have been defined in the continuously 
updated CAZy database [11]. 

Presently, one of the largest GH families is GH5, his- 
torically known as "cellulase family A" as it was the first 
cellulase family described [10]. GH5 exemplifies a family 
with a large variety of specificities: it currently contains 
close to 20 experimentally determined enzyme activities 
denoted with an EC number. The abundance of GH5 
enzymes in different ecological niches has been high- 
lighted by their frequent identification in metagenomes 
of diverse microbial communities [12-14], as well as the 
genomes of individual organisms [11]. As with other 
CAZyme families [15], GH5 members are commonly 
found to be encoded as parts of multi-modular polypep- 
tide chains containing other catalytic, substrate-binding, 
and functionally unidentified or yet to be described 
modules. 

Within the large GH5 family, a discernible diversity of 
sequences was observed soon after its creation. The first 
five subfamilies of GH5 (A1-A5) were identified as early 
as 1990 [16]. Subfamily A6 was introduced in 1997 [17] 
and the following year eukaryotic and prokaryotic (3- 
mannanases were assigned to A7 and A8, respectively 
[18]. Subsequently, subfamily A9 was introduced in a 



study, which notably also suggested the merger of A5 
and A6 [19]. Finally, A10 was the most recently defined 
GH5 subfamily [20], while new subfamilies that pres- 
ently lack a unique identifier have also been suggested 
[21,22]. Family GH5 belongs to clan GH-A, which pres- 
ently groups 19 GH families to form the largest set of 
evolutionarily related GH families described in CAZy 
thus far (a clan is a group of families that arise from a 
common but very distant ancestor; despite weak se- 
quence similarity, clan members share conserved protein 
fold and catalytic machinery). 

Families such as GH5 were originally defined with a 
very small number of sequences. With the accumulation 
of an increasing body of sequence data, the relationship 
between the original families has sometimes changed 
enough to merit reexamination of family membership. 
Very recently, detailed three-dimensional structural ana- 
lysis led to the reclassification of several GH5 sequences 
into family GH30 based on the organization of second- 
ary structural elements around the conserved (|3/a) 8 fold 
of the catalytic module [23]. 

Given the continuing expansion in sequence numbers 
and the partial GH5/GH30 reclassification, it is clear 
that a global re-analysis of the subfamily division of GH5 
is now needed. The rapid accumulation of genomic data 
in the past decade revealed a complex and varied se- 
quence space, with the consequence that a substantial 
portion of GH5 family members are currently not 
assigned to any subfamily. This situation will only be- 
come worse as the rate of (meta)genomic sequencing 
continues to increase with phenomenal rapidity. Further, 
this flood of data will cause an increasing reliance on 
computer-based annotation, which necessarily requires a 
robust framework to produce meaningful functional pre- 
dictions. The division of CAZyme families into subfam- 
ilies based on phylogenetic analysis has been applied as 
a successful approach to meet this challenge: Subfamily 
classification of GH13, GH30 and all of the PL families 
has demonstrated that the majority of the defined sub- 
families were monospecific, thus indicating a signifi- 
cantly better correlation of substrate specificity between 
sequences at the subfamily level than the family level 
[23-25]. Significantly, the division into subfamilies allows 
the identification of currently uncharacterized subfam- 
ilies that can subsequently be analyzed biochemically 
and structurally to potentially unveil new activities. 

Hence, we present here an improved, robust subfamily 
classification for GH5 by employing a large-scale ana- 
lysis of all publicly available sequences. Our intention is 
that the introduction of this additional hierarchical level 
across this important GH family will serve to guide en- 
zyme discovery, structure-function analysis, and biocata- 
lyst improvement in post-genomic efforts. Not least, 
many enzyme activities relevant to biomass analysis and 
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conversion are found in GH5 (e.g., cellulases, manna- 
nases, xylanases, galactanases, and xyloglucanases), as 
are enzymes with biomedical applications [26]. Signifi- 
cantly, the present analysis unveiled a large number of 
sparsely or incompletely characterized subfamilies that 
may still hide a number of unsuspected activities and 
singular structural features. 

Results and discussion 

Our bioinformatics approach allowed the division of 
close to 2300 GH5 catalytic modules into 51 distinct 
subfamilies, as shown in the global phylogenetic tree 
(Figure 1 and Additional file 1: Figure SI); subfamily in- 
formation is summarized in Table 1. Subfamily naming 
follows the procedure devised for GH13, where the fam- 
ily number is followed by an Arabic numeral that 
reflects the order of creation [24]: GH5_1 to GH5_53. 
This series is essentially continuous, with a few excep- 
tions due to historical reasons: All of the previously 
described subfamilies (A1-A10) have been re-identified 



in the current investigation except for A3 and A4, which 
are merged into a single subfamily GH5_4 and A5 and 
A6 which are unified in subfamily GH5_5 (Figure 1). To 
maintain consistency with earlier literature, the re- 
identified historical subfamilies have retained the ori- 
ginal Arabic numeral. For example, the subfamily 
formerly known as A2 is hereby designated GH5_2. The 
absence of subfamilies GH5_3 and GH5_6 reflects the 
two fusion events involving the historical subfamilies 
described above [19]. 

In addition to the new and historical designations, the 
taxonomical range of the included sequences, experi- 
mentally determined enzyme activities and representa- 
tive 3-D structures are presented in Table 1 for each 
subfamily. Notably, all of the 33 enzymes with a solved 
3-D structure have been assigned to a subfamily, result- 
ing in thirteen individual subfamilies out of 51 with at 
least one structural representative. Genes that encode 
GH5 enzymes are present in most organisms ranging 
from Archaea and Eubacteria to Eukaryotes, e.g. fungi 




45 14 49 1547 4622 48 12 



Figure 1 Phylogenetic tree of family GH5. In this circular phylogram, the branches corresponding to subfamilies 1-53 are shown in color and 
the subfamily numbers are indicated next to the exterior color circle. The branches corresponding to sequences not included into subfamilies are 
in black. A detailed version of this tree is found in Additional file 1: Figure SI. 
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Table 1 Newly defined subfamilies within glycoside hydrolase family GH5 



New GH5 
subfamily 



Historical 
subfamily 



Number of 
sequences 



Present taxonomical 
distribution 



EC 
number 



Representative PDB 
structure 



GH5_1 



GH5 2 



GH5_4 



GH5_5 
GH5 7 



GH5_8 
GH5_9 



GH5J0 
GH5J1 
GH5J2 

GH5J3 
GH5J4 
GH5J5 
GH5J6 
GH5_17 
GH5_18 
GH5J9 
GH5_20 
GH5_21 
GH5_22 
GH5_23 

GH5_24 
GH5_25 

GH5_26 

GH5_27 
GH5_28 
GH5_29 
GH5_30 



A1 



A2 



A3 + A4 b 



A5 + A6 b ' c ' e 



A7 a 



A8 a 
A9 e 



A10 e ' 



133 



245 



160 



123 
133 



71 
107 



19 
19 

42 

59 
15 
10 
10 
5 

24 
23 
17 
10 
12 
5 

5 

16 
17 



Archoea Bacteria Eukaryota 



Bacteria Eukaryota 



Bacteria Eukaryota 



Bacteria Eukaryota 
Archaea Bacteria Eukaryota 



Bacteria Eukaryota 
Eukaryota 



Bacteria Eukaryota (Metazoa) 
Eukaryota (Plants;Fungi) 
Bacteria Eukaryota 

Bacteria 
Eukaryota (Plants) 
Eukaryota (Fungi) 
Eukaryota (Fungi) 

Bacteria 

Bacteria 

Archaea 
Eukaryota (Stramenopiles) 

Bacteria 
Bacteria Eukaryota 
Eukaryota (Fungi) 

Eukaryota (Fungi) 
Bacteria 

Bacteria 

Eukaryota (Metazoa;Fungi) 
Bacteria 
Bacteria 
Eukaryota (Fungi) 



3.2.14 
3.2.1.73 
3.2.1.91 

3.2.1.4 
3.2.1.132 

3.2.1.4 
3.2.1.151 
3.2.1.73 

3.2.1.8 

3.2.1.4 
3.2.1.25 
3.2.1. 78 

2.4.1.- 
3.2.1.78 
3.2.1.58 
3.2.1.75 
3.2.1.21 
3.2.1.78 

ND 
3.2.1.21 
3.2.1.45 

ND 
3.2.1.58 
3.2.1.75 
3.2.1.164 
3.2.1.78 
ND 
ND 
ND 

3.2.1.8 

3.2.1.4 
3.2.1.149 
3.2.1.168 
ND 

3.2.1.4 
3.2.1.78 

3.2.1.4 
3.2.1.73 
3.2.1.123 
3.2.1.123 
3.2.1.123 
ND 



2ZUN 



2A3H 



2JEQ 



IGZJ 
IRH9 



2WHL 
3N9K 



2C0H 



3MMW 



20SX 
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Table 1 Newly defined subfamilies within glycoside hydrolase family GH5 (Continued) 



GH5_31 


5 


Eukoryoto (Fungi) 


3.2.1. - 9 




GH5_32 


5 


Eukaryota (Plants; Stramenopiles) 


ND 




GH5_33 


9 


Eukaryota (Stramenopiles) 


ND 




GH5_34 


5 


Bacteria 


3.2.1. - 9 


2Y8K 


GH5_35 


5 


Bacteria 


ND 




GH5_36 


23 


Bacteria 


3.2.1.73 
3.2.1.78 


IVJZ 


GH5_37 


19 


Bacteria 


3.2.1.4 
3.2.1.73 
3.2.1.74 


ICEN 


GH5_38 


10 


Bacteria 


3.2.1. - h 




GH5_39 


7 


Bacteria 


3.2.1.4 




GH5_40 


8 


Bacteria 


ND 




GH5_41 


14 


Bacteria 


ND 




GH5_42 


9 


Bacteria 


ND 




GH5_43 


11 


Bacteria 


ND 




GH5_44 


24 


Bacteria 


ND 




GH5_45 


5 


Bacteria 


ND 




GH5_46 


15 


Bacteria 


3.2.1. - h 




GH5_47 


6 


Bacteria 


ND 




GH5_48 


18 


Bacteria 


3.2.1. - h 




GH5_49 


20 


Eukaryota (Fungi) 


ND 




GH5_50 


7 


Eukaryota (Fungi) 


ND 




GH5_51 


5 


Bacteria Eukaryota (Fungi) 


ND 




GH5_52 


6 


Bacteria 


3.2.1.74 




GH5_53 


6 


Bacteria 


3.2.1.74 





ND Activity not determined yet. 

Experimentally determined. 

b see [16]. 

c see [17]. 

d see [18]. 

e see [19]. 

f see [20]. 

g EC numbers not yet defined. 

h Active enzyme(s) present but with unclear EC number(s). 
* See www.cazy.org for most recent information. 

Historical names for some subfamilies are provided, along with the taxonomical range, the characterization level and structural information from representative 
structures. Known enzyme activities in family GH5 are provided using the following Enzyme Classification (EC) numbers and corresponding activities: 3.2.1.4 - 
endo-p-1,4-glucanase or cellulase; 3.2.1.8 - endo-p-1,4-xylanase; 3.2.1.21 - p-glucosidase; 3.2.1.25 - p-mannosidase; 3.2.1.45 - p-glucocerebrosidase; 3.2.1.58 - 
glucan p-1,3-glucosidase; 3.2.1.73 - licheninase; 3.2.1.74 - cellodextrinase; 3.2.1.75 - glucan endo-(3-1,6-glucosidase; 3.2.1.78 - mannan endo-p-1,4-mannosidase or 
endo-p-1,4-mannanase; 3.2.1.91 - cellulose (3-1,4-cellobiosidase or cellobiohydrolase; 3.2.1.123 - endoglycoceramidase; 3.2.1.132 - chitosanase; 3.2.1.149 - (3- 
primeverosidase; 3.2.1.151 - xyloglucan-specific endo-p-1,4-glucanase; 3.2.1.164 - endo-p-1,6-galactanase; 3.2.1.168 - hesperidin 6-O-a-L-rhamnosyl-p-glucosidase; 
3.2.1 .- - undefined EC numbers for p-1,3-mannanase/p-1,3-glucomannanase or for arabinoxylan-specific p-xylanase or still unclear EC numbers depending on the 
subfamily (see notes and text); 2.4.1 .- - p-mannan transglycosidase. 



and plants. From an anthropocentric perspective, a GH5 
member is notably lacking in the human genome. Exam- 
ples of metazoan GH5 genes are also notably scarce and 
are limited to nematodes, mollusks, and arthropods, 
likely resulting from horizontal transfer. For example, 
several independent horizontal events of transfer of cel- 
lulase and xylanase genes from Bacteria to nematodes 
have been described [27] and the transfer of a bacterial 



(3-mannanase to an insect was recently documented 
[28]. The taxonomical range at the subfamily level is, nat- 
urally, more restricted. A few smaller subfamilies are cur- 
rently specific to certain types of organisms (Table 1). For 
example, eight subfamilies (GH5_15, GH5_16, GH5_23, 
GH5_24, GH5_30, GH5_31, GH5_49 and GH5_50) con- 
tain only fungal sequences. Subfamily GH5_14 contains 
only plant members, whereas members of GH5_20 
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The_fuscalAAZ54938l8l3.2. 1 .78 


□| GH5_8 CBH2 




Str_thermlBAK2678 1 1813.2. 1 .78 


□g| GH5_8 Q| FH3 [| CBM2 | 




Str_gallolCBI12654l8 


[j CBM13 |]j GH5_8 | 




Pae_polymlCCC85403l8 


[j CBH32 || FN3 || CBH13 | FH3 | GH5_8 | 




Rum_albuslADU23074l8 


□| GH5_8 || CBH37 | 




Bac_circulBAA25878l8l3.2.1.78 


□| GH5_8 |[]| CBM59 | 




Spi_thermlAEJ61551l8 


[]| GH5.8 [J CBH64 




Mah_austrlAEE96309l8 


□| ghO ^^^^"PTbHm - 




Vib_MA-l 38IBAA25 1881813.2. 1 .78 


[| GH5.8 | CBM10 | CBM10 | 


b 


Cel Japonl AA03 17601813 .2. 1 .78 


|| GH5.8 p| CBH10 | D | CBM2 | 


Cal_bescilACM60954l8 / Cal_bescilACM60954ll_2 d| ghb.b | D | cbms q cbms p cbhs | D | gh5_i q 




Cal_bescilACM60953l8l3.2. 1 .4+3.2. 1.78 


□| GH9 ||| CBM3 Q CBN3 Q CBM3 |[] GH5_8 | 




Cal_sacchlAAA7 18871813.2. 1 .4+3.2. 1 .78 


[J GH5_8 |g| CBM3 |q| CBH3 |q| GH41 




Cal_cellulAAF22274l8l3.2.1.78 


Cip21 |Q| CBM3 |g| GH5_8 |g| CBH3 | 


c 


Clo cellulADL52789l7 / Clo cellulADL52789l7 2 DDI ™5_? | ,| ghs_7 |i ,| cwi23 | doc | 




CaLsacchl ABP666921 113.2.1.4 


□| GHIO !□] CBN3 ^| GH5.1 |q 




Cla_michilCAA44467l 1 13.2. 1 .4 


□| GHQ_1 []| CB(12 □^J^^| _ CBMB3 — 




Sal_tropilABP56852l2 


□| GH12 ||| CBN13 ||] GH5_2 | 




Ver_bactelEDY84333l34 


[j GH43 | ^| GHb_3<1 CBH6 | CBni3 |j FN3 || — 


d 


Ace_cellulEFL62381l34 


□| GH5_34 || CBH6 | CBH13 || FN3 | CBH62 ||| DOC ||| CE6 | 


Cop_cinerlEAU93630l30 


DD| GH5_30 | 




Xan_oryzalAAW77290 


d W5 I^^^H CBMG3 I 




CelJaponlACE86060 


DDI G"5 1 1 




Can_MidiclAEI88911l48 


[j GH5.48 | 




Clo_thermlABN52700ldist 


mM GH5 | 


Figure 2 Examples of modular GH5 proteins, (a) Diverse modular arrangements of putative monofunctional modular enzymes from subfamily 
GH5_8. (b) Same for putative bifunctional GH5 enzymes containing a subfamily GH5_8 module, (c) Other putative bifunctional enzymes 
containing at least a single GH5 module, (d) Selected examples of proteins containing GH5 modules having lost one or more catalytic residues. 
For a given protein, each GH5 module is identified by a number of fields separated by "\" indicating: (i) the organism, with 3 letters for the genre 
and either 5 letters for the species or full strain code; (ii) the GenBank protein accession; (iii) if attributed, the subfamily number or other 
information; (iv) EC numbers if available. These individual tags are analogous to what is found in Additional file 1: Figure S1. The module types 
and other protein segments present are: GHx_y - glycoside hydrolase family x subfamily y (pink); CEx - carbohydrate esterase module of family x 
(light brown); Cip21 - chitin-binding protein type 21 module with putative carbohydrate oxidative cleaving activity, formerly CBM33 (dark gray); 
CBMx - carbohydrate binding modules of family x (light green); FN3 - fibronectin type III modules (dark green); DOC - cellulosomal dockerin 
modules (light violet); EXPN - expansin modules (dark purple); signal peptides (purple); transmembrane segments (yellow); linkers (light blue); 
other regions (light grey). 



come exclusively from Stramenopiles. These limited taxo- 
nomic distributions may represent biological (catalytic) 
specialization, if not biased by a still incomplete genome 
sequencing of organisms. 

The definition of subfamilies was restricted to phylo- 
genetic clades with five or more members from different 
organisms available in the public protein databases 
(GenBank and UniProt) in order to capture sufficient di- 
versity for a robust subfamily definition. Using these 



criteria, the overall success rate of subfamily grouping 
was of approximately 80%, i.e., about 20% of the ana- 
lyzed GH5 sequences could not be classified into sub- 
families having at least five public members. The 
sequences that have not yet been assigned to subfamilies 
will likely define new subfamilies as the pool of available 
sequences continues to increase. In the future, these 
subfamilies will be gradually released with new identi- 
fiers when they have been sufficiently populated. 
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Compared to the GH13 subfamily classification [24], the 
GH5 subdivision has resulted in both a higher number 
of subfamilies and a higher number of uncharacterized 
subfamilies, suggesting that family GH5 is comparatively 
less well explored. 

To further refine the global subfamily analysis, max- 
imum likelihood phylogenetic analysis has been per- 
formed on each subfamily (Additional file 1: Figures SI: 
GH5_1 - GH5_53). In addition to database accession 
numbers, information about substrate specificity (as pro- 
vided by EC numbers) has been included, and 3-D 
structures highlighted in each subfamily tree. The 51 sub- 
families (numbered GH5_1 to GH5_53 as explained else- 
where) have been categorized based on available enzyme 
activity data, to aid their individual descriptions, below. 
Thus, the first group, comprised of "Enzymatically- Active 
Subfamilies", contains subfamilies with at least one mem- 
ber whose enzyme activity has been shown. The extent of 
the documented characterization varies substantially 
within the group, from simple information obtained from 
enzymatic assays insufficient to assign a particular EC 
number (these enzymes are denoted with a star in 
Additional file 1: Figure SI), to detailed enzyme specificity 
and kinetics studies. Among the better characterized sub- 
families, monospecific subfamilies are distinguished by the 
presence of only one EC number for one or more mem- 
bers, whereas in polyspecific subfamilies, two or more en- 
zymatic activities have been observed in different 
members. "Uncharacterized Subfamilies" comprise the 
second major group; these subfamilies currently lack 
documented enzymatic activity altogether. 

In this context, it is worth noting that the majority of 
the large, well characterized subfamilies were polyspeci- 
fic. These highly populated subfamilies were also the 
first ones to be identified and described, and often, des- 
pite observed polyspecificity, one particular activity pre- 
dominates. For example, 22 of 24 characterized enzymes 
in GH5_1 are reported to be ewdo-glucanases, whereas 
one protein is a documented as a cellobiohydrolase, and 
another was described as displaying licheninase activity. 
It is, however, difficult to draw far-reaching conclusions 
based on these observations. On one hand, it may be 
that the acquisition of a new specificity within a subfam- 
ily is a rare event; alternatively, the observation of one or 
few activities in specific subfamilies may simply be a 
consequence of differences in the range of substrates 
tested experimentally. 

Another significant aspect of many GH5 family pro- 
teins is that their protein sequence may include add- 
itional modules with different functions, and in 
particular CBMs [29]. A great variety of modular struc- 
tures may be found throughout the family and in a num- 
ber of individual subfamilies. An analysis of all the 
complexity of modular structures found in the family 



goes beyond the objectives of this study, and some 
aspects of this diversity are illustrated in Figure 2. For 
instance, many members of subfamily GH5_8 are modu- 
lar and reveal two major trends: (i) the addition of one 
or of multiple CBMs (see Figure 2a) is more common 
and may be associated not only to the nature of the 
main substrate of the corresponding catalytic domain, 
particularly in complex substrates; and (ii) the combina- 
tions with other catalytic modules to form bifunctional 
enzymes (see Figure 2b but also 2c), are more rare but 
particular useful to reveal interacting or synergistic en- 
zyme activities of some catalytic modules. Numerous 
modular arrangements can also be found in other large 
subfamilies like GH5_1, GH5_2 and GH5_7. CBMs can 
be located on the N- or C-terminal side of the GH5 
module (as illustrated in GH5_8 in Figure 2a). The com- 
bination of catalytic domains may target different tissue 
components. Some may, for instance, target cellulose 
and cellulose associated substrates (as in Figure 2b) but 
bifunctional enzymes likely targeting hemicellulose may 
also be found (several examples in Figure 2c). 

Subfamilies with identified active enzymes 
Monospecific subfamilies 

A number of subfamilies exhibit a single activity among 
their characterized enzyme members. Multiple individual 
examples within a subfamily improve the degree of con- 
fidence regarding subfamily monospecificity, while sub- 
families with only a single characterized representative 
may be subject to reinterpretation in the future as the 
breadth of biochemical data increases. 

GH5_5 ewdo~p-l,4~glucanases (EC 3.2.1.4) The largest 
subfamily that contains only a single EC number is 
GH5_5, which is primarily composed of secreted bacter- 
ial and fungal enzymes (Additional file 1: Figure SI: 
GH5_5). All investigated enzymes in GH5_5 display 
e«do-p-l,4-glucanase activity (EC 3.2.1.4). One crystal 
structure has been determined for a Thermoascus auran- 
tiacus ewdo-glucanase [19]. In fungi, about half of the 
GH5 proteins harbor a CBM1 module at the N- or C- 
terminus, which is compatible with an active role on cel- 
lulose. No modular proteins are found among the bac- 
terial members of the subfamily. 

GH5_8, GH5_10 and GH5_17: ewdo-p~l,4~mannanase 
(EC 3.2.1.78) Members of the large GH5_8 subfamily 
are all extracellular mannan ewdo-p-l,4-mannosidases 
(EC 3.2.1.78) according to available biochemical charac- 
terization, and this subfamily was historically described 
as the bacterial mannanase subfamily A8 [18]. Structural 
analysis has highlighted distinctive features of alkaline (3- 
mannanases [30]. The subfamily now contains a single 
eukaryotic enzyme from the beetle Hypothenemus 
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hampai, resulting from a horizontal gene transfer from 
bacteria [28]. 

In the closely related but distinct subfamilies GH5_10 
and GH5_17, only extracellular enzymes with endo-fi- 
1,4-mannanase activity (EC 3.2.1.78) have equally been 
reported. Metazoan sequences, including sequences ori- 
ginating from mollusks and arthropods, as well as bac- 
terial sequences compose the GH5_10 subfamily, 
whereas subfamily GH5_17 only harbors bacterial 
enzymes. Currently, none of the bacterial GH5_10 mem- 
bers have a documented enzyme activity, although a 
Fibrobacter succinogenes enzyme is active on AZCL- 
galactomannan [31]. The bacterial enzymes in GH5_17 
are all cellulosomal components from the genus 
Clostridium. 

GH5_14 : (plant) e*o-p-l,3-ghicosidase (EC 3.2.1.58) 

Subfamily GH5_14 comprises exclusively plant enzymes 
and relies on a single functional characterization at 
present. Recombinant expression of the rice OsGH5BG 
showed that the enzyme had glucan p-l,3-glucosidase 
activity (EC 3.2.1.58) [22]. A unique feature of GH5_14 
is the fascin-like module inserted after (3-strand 1. Fascin 
is a human actin-binding protein, but the function of the 
plant fascin-like domain is unknown. Members of 
GH5_14 are well represented throughout the plant king- 
dom, but most interestingly, a representative is absent in 
the leading plant model organism Arabidopsis thaliana. 

GH5_15 : (fungal) (ewdo~)p~l,6~glucanases (EC 3.2.1.75) 

Subfamily GH5_15 is a small but well characterized sub- 
family composed of secreted fungal enzymes. The identi- 
fied p-l,6-glucanase activity (EC 3.2.1.75) is important 
for the mycoparasitic activity and probably cell wall re- 
cycling by some fungi. The GH5_15 phylogenetic tree 
displays two major clades (Additional file 1: Figure SI: 
GH5_15). The largest clade is formed by enzymes issued 
from fungi from the class of Sordariomycetes. Eurotio- 
mycetes are found the second subgroup. 

GH5_16 : (fungal) e^o-p-l,6-galactanase (EC 3.2.1.164) 

GH5_16 is another example of a monospecific subfamily of 
secreted enzymes where only a single fully sequenced bio- 
chemically characterized p-l,6-galactanase (EC 3.2.1.164) is 
currently known [32], although partial N-terminal sequence 
of an Aspergillus enzyme, closely related to sequences of 
other members of the subfamily, yields the same activity 
[33]. Several other enzymes with p-l,6-galactanase activity 
have been moved from GH5 to GH30_5 [23], but subfamily 
GH5_16 clearly remains within family GH5. The known 
p-l,6-galactanase (EC 3.2.1.164) is involved in larch wood 
arabinogalactan degradation [32]. 



GH5_21 : (bacteroidetes) ef«fo~p~l,4~xylanase (EC 
3.2.1.8) Xylanase (EC 3.2.1.8) activity has been recently 
established [34] for a number of xylanolytic bacteroi- 
detes enzymes belonging to this subfamily. These 
GH5_21 ewdo-xylanases integrate xylan utilization gene 
clusters found in Prevotella and Bacteroides species and 
are all apparently secreted. Significant differences in 
their mode of action have been observed, despite the in- 
clusion in the same subfamily. Different GH5_21 
enzymes were shown to release different products from 
wheat arabinoxylan [34]. 

GH5_22 GH5_31, GH5_34, GH5_39, and GH5_53 : 
single p-glycanase characterizations In addition to 
subfamilies GH5_14 and GH5_16 described above, five 
other subfamilies are distinguished by harboring only a 
single experimentally characterized enzyme. £«do-p-l,4- 
glucanase activity (EC 3.2.1.4) has been determined in 
subfamilies GH5_22 and GH5_39. GH5_31 is a small 
subfamily currently restricted to secreted fungal pro- 
teins. A p-l,3-(gluco)mannanase activity (EC 3.2.1. -) a 
has been reported for an enzyme from Paecilomyces lila- 
cinus [35]. In the small subfamily GH5_34, composed of 
extracellular modular proteins from bacterial origin, 
there is a single enzymatically and structurally character- 
ized enzyme from Clostridium thermocellum. Notably, 
although this is the first reported enzyme with arabinox- 
ylanase activity (EC 3.2.1.-) b it was designated QXyl5A 
in spite of its inability to attack unsubstituted xylans 
[36]. Finally, the small subfamily GH5_53 is a modular 
extracellular subfamily that contains a single character- 
ized cellodextrinase (EC 3.2.1.74) [37]. 

GH5_27 GH5_28 and GH5_29 : ewdo-glycosylcerami- 
dases (EC 3.2.1.123) GH5_27, GH5_28 and GH5_29 
are subfamilies exclusively containing extracellular endo- 
glycosylceramidases. Subfamily GH5_27 is formed of 
sequences of eukaryotic origin while the small subfam- 
ilies GH5_28 and GH5_29 are bacterial. Interestingly, 
the first subfamily is found among the four subfamilies 
that contain metazoan GH5 enzymes. All but one of the 
enzymes found in GH5_28 are from Actinobacteria. The 
crystal structure of a Rhodococcus ewdo-glycoceramidase 
revealed an active site channel atypical for GH5 
enzymes, which explains the unusual substrate for this 
type of enzymes [38]. All the GH5_29 sequences 
reported here are from the genus Rhodococcus. The 
characterized bacterial enzymes of GH5_28 hydrolyze 
ganglio- and lacto-series glycosphingolipids. In contrast, 
the only GH5_29 enzyme investigated is not capable of 
hydrolyzing these substrates. Instead, this enzyme shows 
activity against 6-gala series glycosphingolipids and 
the designation oligogalactosyl-N-acylsphingosine l,T-p- 
galacto-hydrolase has been proposed [39]. 
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GH5_52 : cellodextrinases (EC 3.2.1.74) Two enzymes 
in the small intracellular bacterial GH5_52 subfamily 
exhibited cellodextrinase activity (EC 3.2.1.74). In 
addition two enzymes isolated from the cow rumen have 
been shown to have hydrolytic activity on carboxymethyl 
cellulose (CMC) agar [14]. 

Polyspecific subfamilies 

Some GH5 subfamilies group together a panel of activ- 
ities and are described as polyspecific subfamilies. The 
apparent plasticity of these subfamilies suggests that 
only few subtle changes could be sufficient to switch 
from one activity to the other. More likely, many of the 
enzymes present on the subfamily level are polyspecific 
to some extent and therefore should have more than a 
single EC attributed. 

GH5_1 and GH5_2 : p~l,4~glucan cleaving enzymes 

Extracellular enzymes from archaea, bacteria and uncul- 
tured symbiotic protists are represented in subfamily 
GH5_1 (Additional file 1: Figure SI: GH5_1). The activ- 
ity observed for most characterized GH5_1 enzymes is 
e#<i0-p-l,4-glucanase activity (EC 3.2.1.4). The apparent 
exception in subfamily GH5_1 is an e^o-acting cellobio- 
hydrolase activity (EC 3.2.1.91) from Clostridium ther- 
mocellum [40]. However, one should note that the 
biochemical distinction of exo- versus ewdo-acting cellu- 
lases is particularly difficult to establish experimentally. 
Interestingly, an enzyme from Ruminococcus albus able 
to cleave CMC and glucomannan but particularly active 
on lichenin was recently described [41]. Many proteins 
in this subfamily are modular (data not shown), a feature 
shared with many members of subfamily GH5_8 as 
described previously. 

Subfamily GH5_2 is currently the largest in family 
GH5. This subfamily of extracellular enzymes, many of 
which are multimodular, contains a large number of 
characterized members that display ew<io-p-l,4-gluca- 
nase activity (EC 3.2.1.4). These ewdo-glucanases are dis- 
tributed across the subfamily tree and are found in every 
major clade Additional file 1: Figure SI: GH5_2). One 
e^do-glucanase from Fibrobacter succinogenes S85 in 
this subfamily was reported to be active both on CMC 
and oat spelt xylan [42], but xylanase activity has not 
been observed in other members thus far. Interestingly, 
one representative of this subfamily has been reported as 
a chitosanase (EC 3.2.1.132) with transglycosylation ac- 
tivity [43]. A bifunctional cellulase/chitosanase has also 
been identified in Bacillus sp. NBL420 [44] in a different 
clade. Significantly, a closely related N-terminal se- 
quence of a bifunctional cellulase/chitosanase from Myx- 
obacter sp. AL-1 [45], suggests that a specific subgroup 
bearing both activities may exist. As for subfamily 



GH5_1, many members of this subfamily are multimod- 
ular, having both CBMs and cellulosomal-like dockerins. 

GH5_4 : (xyloglucan- specific) ewdo~p~l,4~glucanases 
(EC 3.2.1.4 and EC 3.2.1.151), licheninases (EC 
3.2.1.73), and xylanases (EC 3.2.1.8) Subfamily GH5_4 
members are typically extracellular bacterial enzymes, al- 
though some members come from ciliates and fungi, 
predominantly rumen organisms. In total, four enzyme 
activities have been reported for subfamily GH5_4- Thus 
far, GH5_4 is the only subfamily containing enzymes 
with reported xyloglucanase activity (EC 3.2.1.151) [46]. 
Interestingly, the xyloglucanases are found in two differ- 
ent clades of the GH5_4 subfamily tree suggesting that 
the switch of enzyme activity inside this subfamily oc- 
curred at different times. (Additional file 1: Figure SI: 
GH5_4). Licheninases (EC 3.2.1.73) have also been 
described in this subfamily. Significantly, ewdo-p-1,4- 
xylanase activities (EC 3.2.1.8) have been reported for a 
few enzymes, but always in conjunction with other activ- 
ities. For example, the xylan degrading specific activity 
of Clostridium cellulovorans EngB and EngD are both of 
approximately 14% of their respective specific activities 
on lichenan [47]. Such features suggest an important de- 
gree of enzyme promiscuity given the structural similar- 
ity of the (3-linked substrates. However, the most 
commonly reported EC number for representatives of 
GH5_4 is EC 3.2.1.4. Except for a few fungal pathogen 
members of the subfamily that bear a CBM1 module, no 
other known CBM is present, in sharp contrast to what 
has been found for extracellular subfamilies GH5_1 and 
GH5_2. 

GH5_7 : P~l,4~mannan-cleaving enzymes (EC 3.2.1.78 
and EC 3.2.1.25) In subfamily GH5_7, previously 
named A7 (originally comprised of only eukaryotic man- 
nanases but now also containing archaeal and bacterial 
members) [18], the three reported enzyme activities are 
associated to the degradation or to the modification of 
(3-mannan-containing polysaccharides. Virtually all exa- 
mined GH5_7 enzymes possess e«do-|3-l,4-mannanase 
activity (EC 3.2.1.78). One exception is the tomato pro- 
tein LeMan4a, which in addition to hydrolytic activity, 
can act in vitro as a mannan transglycosylase (EC 2.4. 1-) 
[48]. Recently, mannan transglycosylase activity was also 
reported for two fungal GH5_7 enzymes [49]. It is not 
impossible that further GH5_7 enzymes may reveal a 
transglycosylase activity in the future, since the distinc- 
tion between hydrolytic and transglycosylase activity is 
dictated by the tendency of the glycosyl- enzyme inter- 
mediate to be intercepted by a water molecule or a 
saccharide molecule, respectively. Another interesting 
case among GH5_7 members is the p-mannosidase 
CmMan5A, which is able to release mannose from the 
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non-reducing end of manno-oligosaccharides and -poly- 
saccharides (EC 3.2.1.25) and is thus exo- and not endo- 
acting. This difference was explained by the length of 
three loops which modify the active center accessibility 
[50]. Furthermore, this subfamily has both extracellular 
and intracellular enzymes. Many GH5_7 extracellular 
enzymes from fungi contain a CBM. In many bacteria, 
CBMs from different families are also found appended 
to GH5_7 catalytic modules (data not shown). 

GH5_9 : fungal cell wall modifying enzymes This sub- 
family contains only sequences of fungal origin puta- 
tively found in different cell locations: some are 
secreted, several present a GPI-anchor, others have sin- 
gle transmembrane segments and yet others appear to 
be intracellular. The activity e#o-p-l,3-glucanase (EC 
3.2.1.58) has been described for most of the character- 
ized examples, all found among the apparently secreted 
enzymes. Structural investigation of the Candida albi- 
cans Exg protein provides a clue to the evolution of 
these exo-hydrolases; the structure reveals an active site 
pocket shaped for cleavage of (3-1,3- but not p-1,4 glyco- 
side linkages [51]. Surprisingly, the periplasmic Exgl 
protein from Schizosaccharomyces pombe was demon- 
strated to be an e«do-p-l,6-glucanase (EC 3.2.1.75), 
while the membrane-anchored protein Exg2 protein was 
shown to produce cell wall material when over-expressed 
[52]. Two of the characterized GH5_9 e#o-p-l,3-glu- 
canases have been also described as p-glucosidases 
(EC 3.2.1.21) [53]. 

GH5_12 : p-glucosylceramidases (EC 3.2.1.45) and 
(flavonoid) (S-glucosidase (EC 3.2.1.21) Previously 
known to contain a flavonoid p-glucosidase (EC 3.2.1.21) 
from yeast [53], the panorama of specificities in this sub- 
family recently expanded to include several fungal p- 
glucosylceramidases (EC 3.2.1.45) [54]. The subfamily is 
grouped into a fungal clade, including three Cryptococ- 
cus sequences (one now characterized), and a clade 
dominated by bacterial enzymes (Additional file 1: 
Figure SI: GH5_12). 

GH5_23 : fungal p-diglycosidases (EC 3.2.1.149 and 
EC 3.2.1.168) Subfamily GH5_23 is composed of secreted 
fungal proteins, two of which have been characterized as p- 
diglycosidases that break down plant diglycoconjugated fla- 
vonoids. Deglycosylation of these compounds most often 
involves the sequential action of two p-glycosidases in con- 
trast to the one-step hydrolytic release of the disaccharide 
moiety from the aglycone by p-diglycosidases [55]. The 
characterized enzymes are a hesperidin 6-O-a-L-rhamno- 
syl- p-glucosidase (EC 3.2.1.168) from Stilbella fimetaria 
from which a partial sequence has been obtained [56] and a 
reported p-primeverosidase (EC 3.2.1.149) from Penicillium 



multicolor TS-5 [57]. This subfamily is present in the gen- 
era Aspergillus and Penicillium known for interaction with 
plants. 

GH5_25 : efM/o-p-l,4~glycanases (EC 3.2.1.4 and EC 
3.2.1.78) Most enzymes found in subfamily GH5_25 are 
derived from thermophiles. Interestingly, characterized 
enzymes in this subfamily represent examples of GH5 
enzymes with multiple activities. For instance, Cel5A 
from Thermotoga maritima exhibits activity on both p- 
mannan-based and p-glucan-based polymers. Analyses 
of the 7mCel5A structure have highlighted features im- 
portant both for the nature of the duality and the ther- 
mostability [58,59]. 

GH5_26 : ewdo-p-l,4~glycanases (EC 3.2.1.4 and EC 
3.2.1.73) GH5_26 is a small subfamily with a majority of 
sequences from uncultured microorganisms. The dom- 
inating activity found is ew<io-p-l,4-glucanase (EC 
3.2.1.4), but one enzyme has also high activity against 
lichenan (EC 3.2.1.73) [60]. 

GH5_36 eftt/o-b-l,4~glvcanases (EC 3.2.1.73 and EC 
3.2.1.78) Several bacterial phyla are represented in sub- 
family GH5_36. One enzyme has a demonstrated endo- 
P-l,4-mannanase activity (EC 3.2.1.78), whereas a second 
enzyme exhibits licheninase activity (EC 3.2.1.73) in 
addition to p-mannanase activity [61,62]; a 3-D structure 
is available for the latter enzyme. 

GH5_37 : ewdo~p~l,3/4~glycanases (EC 3.2.1.4 and EC 
3.2.1.73) + cellodextrinase (EC 3.2.1.74) Three differ- 
ent activities are found in subfamily GH5_37, which 
consists of sequences of bacterial origin encoding intra- 
cellular proteins. The majority of the characterized 
enzymes are cellulases (EC 3.2.1.4), but there are also 
examples of licheninase activity (EC 3.2.1.73) [41], and 
cellodextrinase activity (EC 3.2.1.74) [63]. 

GH5_38 and GH5_46 : bacterial enzymes active on 
model plant cell wall (PCW) compounds In subfamily 
GH5_38, three enzymes isolated from rumen metage- 
nomic projects have been partially characterized [12,14]. 
To these, we may add a Prevotella ruminicola 23 enzyme 
described as cellulase (PRU_1856) and shown to be ac- 
tive against CMC, Avicel, and lichenan [64]. Another en- 
zyme discovered in the microbial community of the cow 
rumen is active on CMC and belongs to subfamily 
GH5_46 [14]. This is the single evidence of activity in 
this subfamily. 

GH5_48 : bacterial enzymes active on chitin and chit- 
osan derivatives Several bacterial phyla are currently 
represented in subfamily GH5_48, mostly composed of 
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Table 2 Characterized carbohydrate-active enzymes of family GH5 not yet classified into subfamilies 



Description 


EC 


Assay 


Organism 


Accessions 


Modular 
structure 


Taxonomic class 


(3-mannanase A (ManA;CelA) 


3.2.1.78 




Caldanaerobius 
polysocchorolyticus KM-THCJ 


AAD09354 


GH5 CBM16 
CBM16 


B-Firmicutes_Clostridia 


endo-(3-1,4-glucanase D (CelD; 
CelCCD; EGCCD; Ccel_0840) 


3.2.1.4 




Clostridium cellulolyticum H10 [B] 


BAA 14354 
ACL75216 


GH5 CBM11 
DOC 


B-Firmicutes_Clostridia 


endo-p-1 ,4-glucanase/b-1 ,3:1,4- 
glucanase H (CelH) 


3.2.1.4 
3.2.1.73 




Clostridium thermocellum NOB 
10682 


AAA23225 


GH26 GH5 
CBM11 DOC 


B-Firmicutes_Clostridia 


cellulase (EBI-244) 


3.2.1.4 




Desulfurococcoceoe orchoeon 
EBI-244 


AEB53062 


GH5 


A-Crenarchaeota 


endo-(3-1,4-glucanase 3 (Cel3;Cel- 
3; Eg3; Fisuc_2230; FSU_2772) 


3.2.1.4 




Fibrobocter succinogenes subsp. 
succinogenes S85 


AAA24893 
ACX75816 
ADL25000 


GH5 


B- 

Fibrobacteres_Acidobacteria 
group 


Fisuc_2933/FSU_0196 




AGM 


Fibrobocter succinogenes subsp. 
succinogenes S85 


ACX76513 
ADL26912 


GH5 CBM4 


B- 

Fibrobacteres_Acidobacteria 
group 


Fisuc_1523/FSU_2005 




MUC 


Fibrobocter succinogenes subsp. 
succinogenes S85 


ACX75120 
ADL26743 


GH5 


B- 

Fibrobacteres_Acidobacteria 
group 


1 1 //— i a 1 101 n\ 

endoglucanase (CelA; I pg 1918) 


3.2.1.4 


AHEC 


Legionella pneumophila subsp. 
pneumophila str. Philadelphia 1 


AAU27988 


GH5 


B-Gammaproteobacteria 


endo-(3-1,4-glucanase 5B 
(Sde_2490) 


3.2.1.4 




Saccharophagus degradans 2-40 


ABD81750 


CBM6 GH5 


B-Gammaproteobacteria 


endo-(3-1,4-glucanase 5E 
(Sde_2929) 


3.2.1.4 




Saccharophagus degradans 2-40 


ABD82186 


CBM6 CBM6 
GH5 


B-Gammaproteobacteria 


endo-(3-1,6-glucanase (Exg3; 
SPBC2D10.05) 


3.2.1.75 




Schizosaccharomyces pombe 
972 h- 


CAA21163 
NP_596224 


GH5 


E-Fungi 


endo-(3-1,4-glucanase (Cel5G) 


3.2.1.4 




uncultured bacterium 


ADD71777 


GH5 


B-environmental samples 


bAKIVI_UUo4/oy4/ I d_d jooU/ I Woy 




I \c 
LIL 

CMC 


uncultured organism 


ADaUj/Uj 


r~l_|c; 
blij 


U-unclassified sequences 


SARM_0047/1 057205_1 58590/ 
TW-15 




CMC 


uncultured organism 


ADX05718 


GH5 


U-unclassified sequences 


SARM_0086/0_06533^/V-1 8 




PCW 


uncultured organism 


ADX05761 


GH5 


U-unclassified sequences 


P-glucanase (RR.06; RR.06-1; BgIC) 




BBG 
Cel5 
LIC 


unidentified microorganism 


CAJ 19140 


GH5 


U-unclassified sequences 


P-glucanase (RR.10; RR.10-1) 




BBG 


unidentified microorganism 


CAJ 19146 




U-unclassified sequences 



The active enzymes are tagged by their Enzyme Classification (EC) number (see Table 1) or by the significant positively assayed substrates that are: AEHC - AZCL- 
HE cellulose; AGM - AZCL-galactomannan; BBG - barley p-glucan; Cel5 - cellopentose; CMC - carboxymethyl cellulose; LIC -lichenan; MUC - 4-methylumbelliferyl- 
P-D-cellobioside; PCW - plant cell wall. 



extracellular and membrane-anchored proteins of un- 
known function. A single member, a partial sequence 
(not shown in trees because incomplete) coding for a 
protein from Pseudomonas putida P3(4)), over 98% iden- 
tical to locus PPS_1333 (GenBank accession AEJ11908) 
from Pseudomonas putida SI 6, has been described as a 
bifunctional enzyme as it was active on preparative 
forms of chitin and colloidal chitosan and on the 
model compounds pNP-p-N-acetylglucosaminide and 4- 
methylumbelliferyl-N-acetyl (3-D-glucosaminide [65]. Gi- 
ven that chitosanases are already present in family GH5, 
related activities are not unexpected. Interestingly, one 



sequence in GH5_48 seems to lack the catalytic machin- 
ery (Additional file 1: Figure SI). 

Subfamilies lacking experimental characterization 

A large number of subfamilies still require evidence of 
enzyme activity (Table 1). A large characterization effort 
is still needed to identify the hidden activities of these 
subfamilies. For a number of subfamilies, particularly for 
those containing bacterial enzymes, direct hints of the 
activities to be tested can be obtained from the add- 
itional modules present, such as CBMs, but also from 
operon-like organizations. But the contribution of other 
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more indirect approaches like transcriptomics and func- 
tional metagenomics using innovative sets of conditions 
and enlarged classes of substrates are equally potent to 
provide clues to enzyme function. The uncharacterized 
subfamilies are distributed across the GH5 phylogenetic 
tree. However, in one of the three major clades, a large 
block of uncharacterized subfamilies (GH5_13, GH5_18, 
GH5_19, GH5_30, GH5_41 and GH5_42) is located be- 
tween characterized subfamilies containing (3-mannan- 
acting enzymes (GH5_7, GH5_10, GH17, and GH5_31) 
(Figure 1). It is tempting to speculate that at least some 
of these uncharacterized subfamilies target substrates 
related to p-mannan. 

Unclassified GH5 proteins 

Roughly 20% of the analyzed GH5 sequences were not 
assigned to subfamilies, although some of these proteins 
have been characterized to different levels. The main 
reason for the inability to assign them to subfamilies was 
the lack of sufficiently related sequences to define a sub- 
family with at least five members. This is likely to 
change in the future due to a predicable increase in the 
number of available sequences and these unclassified 
sequences represent a pool out of which many more 
subfamilies will emerge. A total of 17 characterized 
enzymes were identified in this diverse set of non- 
classified sequences (see Table 2). Interestingly, two 
main groups arise: (i) post-genomic enzyme characteri- 
zations, and (ii) enzymes originating from functional 
metagenomics-based discovery and characterization. 
Here, the cytoplasmic e«do-p-l,6-glucanase Exg3 from 
the model organism Schizosaccharomyces pombe [52] 
is a representative of the former efforts focused on the 
identification of fundamental activities. On the other 
hand, the archael multidomain hyperthermophilic cellu- 
lase EBI-244 screened for its ability to degrade CMC 
at high temperature represents the latter category of 
efforts [66]. 

Non-catalytic GH5 subfamilies and GH5 modules 

Interestingly, the analysis subjacent to the subfamily 
classification revealed proteins that are likely to be cata- 
lytically inactive (or perhaps possess alternate mechan- 
isms and activities to the canonical GHs), due to 
incomplete catalytic machinery (see Figure 2d). GH5_30 
is the only subfamily where all sequences are lacking the 
essential amino acids for GH activity. In addition, some 
members of other groups also appear to have lost their 
catalytic machinery, such as a Xanthomonas-specific 
subgroup that appears to rapidly emerge from subfamily 
GH5_1 (Additional file 1: Figure SI). All three members 
of this subgroup present an architecture where the in- 
active GH5 module is appended to an expansin module 
and an adjacent CBM63 module at the C-terminus. 



Although no catalytic chemical activity has been identi- 
fied for expansin modules, it is significant that the 
knockout of CelXoB renders Xanthomonas oryzae pv. 
oryzae KACC 10331 avirulent [67]. Another emerging 
non-catalytic GH5 subfamily is closely related to sub- 
family GH5_8. All three members of this new subgroup 
are lipoproteins that combine the apparently catalytically 
inactive GH5 module with a large C- terminal extension. 
These two subgroups share long branches in the com- 
mon GH5 tree suggesting rapid evolution. Interestingly, 
a single member of subfamily GH5_48 has lost the cata- 
lytic machinery. Besides this loss, it is however still simi- 
lar to other members of the subfamily. Whether this is a 
snapshot of the early steps of a new arising function or a 
sequencing error it is premature to say. Finally, the only 
distant GH5 member having lost its catalytic acid-base 
that has been functionally characterized is the putative 
carbohydrate biosensor Rsi24C-GH5 from Clostridium 
thermocellum ATCC 27405 [68]. Although its catalytic 
activity was lost, the extracellular GH5 module was 
shown to interact with crystalline cellulose so that a rec- 
ognition signal could be conveyed by to its intracellular 
N-terminal anti-a factor. The losses of the catalytic 
machinery here described convey that family GH5 
sequences are also subject to recurring evolution that 
leads to novel functions. This type of evolutionary event 
has been described previously in other GH families. For 
instance, amino acid transporters derived from ancestral 
a-amylases are found in family GH13 [24], inactivated 
chitinases evolved into xylanase inhibitors in family 
GH18 [69], and mammalian lactalbumins, which are 
related to GH22 lysozymes, are all well-known examples 
of the recent evolution of glycosidases to acquire novel 
functionalities [70]. 

Conclusions 

When the first five historical subfamilies of GH5 were 
established in 1990, the total number of GH5 protein 
sequences was 21 [16]. More than 20 years later (August 
2012), this number has increased approximately 150 
times to exceed 3,200. The practical difficulties of hand- 
ling such large datasets notwithstanding, this abundance 
of sequences is both a boon and a bane for phylogenetic 
analysis and functional prediction. 

Assigning proteins to a large GH-family, like GH5, 
which harbors multiple specificities and activities, does 
not unlock the full potential of sequence-based classifi- 
cation. Thus, one aim of the present investigation was to 
obtain an improved correlation between protein 
sequences and catalytic specificity by refining a finer 
hierarchical level, the subfamily, for GH5 members. Up 
to 80 percent of the existing GH5 sequences were segre- 
gated into 51 subfamilies. Of these subfamilies, a total of 
31 contained at least one member characterized to some 
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degree, whereas 20 lacked enzymatically-characterized 
members altogether. Out of the 31 subfamilies charac- 
terized to some extent, 17 were monospecific and eleven 
were polyspecific (containing two or more enzyme activ- 
ities). Nonetheless, one activity typically predominates 
within polyspecific subfamilies. Interestingly, both endo- 
and e^o-acting enzymes have been observed in the same 
subfamily, e.g. GH5_1, GH5_7 and GH5_9, illustrating 
that (if real) the two types of activities reflect details of 
the three-dimensional structures. As a consequence of 
the canonical double displacement mechanism employed 
by GH5 enzymes, which involves the formation of a co- 
valent glycosyl-enzyme intermediate, GH5 members can 
potentially catalyze transglycosylation in addition to, or 
instead of, hydrolysis [71]. Although the amount of bio- 
chemical data is presently limited, we observed that sub- 
family classification in GH5 does not appear to correlate 
with transglycosylation activity, thus indicating that this 
property is also a consequence of subtle protein struc- 
tural details. 

This effort provides a first comprehensive view of the 
coverage and distribution of the curated set of 400 ex- 
perimentally characterized enzymes in the GH5 family 
and couples this information with an extensively up- 
dated sequence-based GH5 subfamily division. In par- 
ticular, it provides insights into the evolution of GH5 
proteins, and the classification results can be used to as- 
sist in candidate protein selection for enzyme discovery 
and bioprospecting projects. For instance, both the 
members from the twenty defined subfamilies lacking 
functional characterization, as well as the numerous 
phylogenetic outliers, provide a vast number of interest- 
ing targets for future studies. In particular, although a 
significant amount of tertiary structural data is already 
available for GH5, the present work highlights that a 
large number subfamilies would benefit from a 3-D 
structure for at least one subfamily member. Moreover, 
the data presented here, and available at the CAZy data- 
base [11] as a community resource, will serve as a guide 
for protein engineering approaches exploiting the diverse 
activities found within the GH5 family. 

Finally, in the present climate in which sequence data 
is literally flooding public databases, incorrect protein 
function annotations are too easily propagated by auto- 
mated computer-based prediction methods, thereby 
jeopardizing the usefulness of these annotations. Increas- 
ingly rapid sequence accumulation is worsening the sce- 
nario. This problem is particularly illustrated by the 
GH5_11 subfamily of plant and fungal proteins that are 
annotated as cellulases in public databases, including the 
widely-used Arabidopsis Information Resource [72], des- 
pite a complete lack of experimental support for any one 
of its members. Such excesses of over- annotation equally 
affect the presumed non-catalytically active proteins and 



subfamilies. For example, subfamily GH5_30 further ex- 
emplifies the pitfalls of automated (mis-)annotation: sev- 
eral members are publicly annotated as mannanases, 
although they lack the conserved catalytic machinery of 
the family. To avoid such error propagation, we strongly 
advocate designating all predicted enzymes as "GH5_#" 
(where n is the subfamily number) until an activity 
has been rigorously demonstrated by biochemical 
experimentation. 

The GH5 subfamily classification presented here pro- 
vides a framework to sort family members into meaning- 
ful, predictive categories. By taking a conservative 
approach to protein annotation, this method offers a 
rigorous strategy to avoid misleading functional predic- 
tion in large-scale genomic sequencing projects. Whilst 
the subfamilies described herein generally act on a single 
substrate (seventeen monospecific subfamilies were 
identified), it is important to stress that precise details of 
glycoside hydrolase function, such as the extent of endo- 
ws, exo- modes of cleavage or the transglycosylation- to - 
hydrolysis ratio is unlikely to be predictable from se- 
quence alone. We therefore recommend that such over- 
reaching predictions be altogether abandoned in gen- 
omic sequence annotation. To aid and advance global 
efforts in de novo sequence annotation, the GH5 sub- 
family classification scheme is now publicly available at 
the CAZy database [11]. 

Methods 

The GH5 subfamilies were defined using the methods 
described for the subfamily classification of GH13 and all 
PL families [24,25], which are briefly summarized here. 
After an initial removal of obviously incomplete and/or er- 
roneous sequences, a total of 2347 full length GH5 cata- 
lytic module sequences were retrieved from the CAZy 
database (October 2011) and subdivided into two sets. 
One set of 414 sequences contained all GH5 modules 
from biochemically characterized and sequences positively 
tested in activity tests against a variety of substrates. The 
second set was composed of 1957 non-characterized 
sequences The latter subset was clustered at 75% identity 
using UCLUST4.0, a part of the USEARCH 4.0 package 
[73], and was reduced to 971 sequences. When combined 
with the 414 GH5 module sequences from a biochemically 
characterized and positively assayed subset we obtained a 
total of 1385 sequences. These sequences were aligned 
using MUSCLE 3.7 [74] in two steps. An initial alignment 
was performed and its quality visually inspected so that 
the remaining incomplete and problem sequences were 
identified and edited or removed. This procedure ensured 
that the GH5 module boundaries were clear and that the 
sequences were trimmed if required, and that a majority 
of the residues constituting the catalytic site was present 
or that the alignment was not ambiguous. A final set of 



Aspeborg et al. BMC Evolutionary Biology 2012, 12:186 
http://www.biomedcentral.com/1471 -21 48/1 2/1 86 



Page 14 of 16 



1367 remaining sequences were then realigned using the 
same procedure. The eliminated, often fragmentary, 
sequences were used to complement biochemical activity 
information when relevant at later stages. 

The resulting multisequence alignment of family 
GH5 catalytic domain sequences was used to infer an 
approximate-maximum-likelihood phylogenetic tree with 
FASTTREE 2.1 [75], a program adapted to the analysis of 
large sequence sets, using the Whelan Goldman model of 
amino acid evolution, the gamma option to rescale the 
branch lengths and compute a Gamma20-based likeli- 
hood, a total of four rounds of minimum-evolution moves, 
and options to make the maximum-likelihood nearest- 
neighbor interchanges more exhaustive. The identifica- 
tion and tagging of subfamilies followed a multi-step 
procedure. First, the tree was analyzed to tag the differ- 
ent sequences and nodes corresponding to the first 10 
"historical" subfamilies (Al to A10, described in the 
Introduction). These steps were performed to ensure 
continuity in subfamily definitions. For the remaining 
sequences in the tree, distinct nodes corresponding 
potential subfamilies were visually identified and their 
consistency checked using a procedure similar to that 
described [24]. Within each of these groups, different 
starting sequences were selected and manual BLAST2 
queries [76] performed against all the sequences found in 
CAZy in order to identify self-contained ensembles and 
establish subfamily limits. This analysis ensured that the 
each subfamily that was retained was singular and that the 
removal of sequences by the initial UCLUST filtering pro- 
cedure and of fragmentary sequences did not introduce 
any bias. Finally, only subfamilies containing at least five 
sequences from different organisms were considered. 

Subsequently, for each defined subfamily, maximum 
likelihood (ML) phylogenetic trees were built by using 
PhyML [77], and the reliability of the inferred relation- 
ships the trees was tested by bootstrap analysis using 
100 resamplings of the data set. 

Endnotes 

a No EC number is presently available for (3-l,3-(gluco) 
mannanase activity. 

b No EC number is presently available for arabinoxyla- 
nase activity. 

Additional file 



Additional file 1: Figure SI. Rectangular phylogram view of the 
phylogenetic tree of family GH5. Branches corresponding to subfamilies 
1-53 are shown in color and the individual subfamilies have their 
corresponding subfamily numbers as indicated in Figure I.The branches 
corresponding to sequences not included into subfamilies are in black. 
Each individual protein module node is identified by a varying number of 
fields separated by "\" indicating: (i) the organism, with 3 letters for the 
genre and either 5 letters for the species or full strain code; (ii) the 



protein accession in public databases, typically GenBank; (iii) if attributed, 
the subfamily number or other information; (iv) if available, EC numbers 
(node in bold) or a "*" (node in bold and italic) to indicate precise 
enzyme characterizations or a simple activity tests, respectively. A suffix 
like "_2" may indicate the module position if more than one GH5 module 
is present on peptide. Lower confidence nodes with a SH-like local 
support below 0.7 (varying from low 0 to strong 1) are indicated with a 
black dot. Identified sequences without complete catalytic machinery are 
in red. Individual subfamily trees are also included in this file. 
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